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(N . Abstract 

1} . In this paper we consider stochastic multiarmed bandit problems. Recently a policy, 

P_h ■ DMED, is proposed and proved to achieve the asymptotic bound for the model that each 

reward distribution is supported in a known bounded interval, e.g. [0,1]. However, the 
derived regret bound is described in an asymptotic form and the performance in finite time 
has been unknown. We inspect this policy and derive a finite-time regret bound by refin- 
ing large deviation probabilities to a simple finite form. Further, this observation reveals 
that the assumption on the lower-boundedness of the support is not essential and can be 
replaced with a weaker one, the existence of the moment generating function. 

1 Introduction 

In the multiarmed bandit problem a gambler pulls arms of a slot machine sequentially so that the 
total reward is maximized. There is a tradeoff between exploration and exploitation since he cannot 
£N1 ' know the most profitable arm unless pulling all arms infinitely many times. 

There are two main formulations for this problem: stochastic and nonstochastic bandits. In the 
stochastic setting rewards o f each arm follow an unknown distribution (|Gittinsl .[l989; Agrawaj |1995t 
IVermorel and Mohrit 120 05). whereas the rewards are detemined by an adversary in the nonstochastic 
setting (jAuer et all l2002bf ) . In this paper we consider the stochastic bandit, where rewards of arm 
i G {1, . . . , K} are i.i.d. sequence from unknown distribution Fi £ F with expectation //$ for a model 
F known to the gambler. For the maximum expectation /i* = max^ fj,i, we call an arm i optimal if 
fii = fx* and suboptimal otherwise. If the gambler knows each [ii beforehand, it is best to choose 
optimal arms at every round. A policy is a strategy of the gambler for choosing arms based on the 
past result of plays. The performance of a policy is measured by the loss called expected regret or 
regret, in short, given by 



X\ £ fc*-w)E[T<(n)] 



i:/ii</i* 

where Tj(n) is the number of plays of arm i through the first n rounds. Since we regard each fj,i as 
a unknown constant fixed in advance, we consider how we can reduce E[Tj(n)] for each suboptimal 
ar m % to ach i eve a small regret. 

iRobbins! (|l952f) first considered this setting and lLai and Robbinsl (|l985f ) gave a framework for 
determining an optimal policy by establishing a theoretical bound for the r egret. Later this theoret- 
i cal bo und was extended to multiparameter or nonparametric models F bv lBurnetas and~K atchakis 
(|1996[ ). In their paper, it was proved that any policy satisfying a mild regularity condition satisfies 

m(n)}< 1 ~° (1 , ) _. logn , (1) 

where D ln f (F, fj,; F) is defined in terms of Kullback-Leibler divergence -D(-||-) by 

D ini {F, lf i*;F)= inf D(Fi\\G) . 

The most popular model in the nonparametric setting is the family of distributions with sup- 
ports contained in a known bounded interval, e.g. [0, 1]. For this model, which we denote by Ao, 



it is kn own that fine performance can be obtained by policies called Upp er Confidence Bound 
(UCB) ()Auer et all . I2002al lAudibert etHl 120091: iGarivier and Cappd 12011ft . However, although 
some bounds for regrets of UCB policies have been obtained in a non- asymptotic form, they do not 
necessarily achieve the asymptotic theoretical bound. 

Recently iHonda and Takemural (|2010l ) proposed Deterministic Minimum Empirical Divergence 
(DMED) policy, which chooses arms based on an index Anf (Fi, /•*; Aq), or simply written as Di n f(Fi, fi), 
for empirical distribution Fi of arm i. Whereas DMED achieves the theoretical bound asymptot- 
ically, the evaluation heavily depends on an asymptotic analysis and any finite-time regret bound 
has been unknown. Further, in the analysis of DMED, the assumption on the lower bound of the 
support seems to be a technical one needed for the proof. For example, the gambler does not have 
to know that the lower bound of the support is zero if he knows that the upper bound is one. 

Our Contribution. Based on the above observation, we consider the family A of distributions 
on (—00, 1] instead of Aq. We first show that Anf (F, A*; Aq) — Anf (F, fi; A) for all F <E Aq. Thus, 
although the gambler has more candidates for the true distribution of each arm in the model A than 
in Aq, the theoretical bound (JJ does not vary between _4 and A. 

Next we provide a finite-time regret bound of DMED for all distributions in A with moment 
generating functions existing in some neighborhood of the origin. Since nonstochastic bandits in- 
evitably require the boundedness of the support, we can now assert that an advantage of assuming 
stochastic bandits is that the semi-bounded rewards can be dealt with in the nonparametric setting. 

Technical Approach. In the evaluation of DMED it is essential to evaluate the probability that 
Anf (-^i) m) deviates from Anf (Fi, fi)- Note that for policies based on the index D- m f(Fi, fi), finite-time 
regret bounds have been derived for the case that each d istribution is supported in a finite subset of 
[0, 1] (jMaillard et all . 1201 it IHonda and Takemural 12011ft . The advantage of assuming finiteness is 
that Sanov's theorem gives a non-asymptotic large deviation probability. However the regret bounds 
derived by this technique contain a finite but exceedingly large term 

^ ^supp(Fi) e -at 
t=l 

where |supp(Fj) | denotes the size of the support of Fi and the polynomial £I su pp( f )I appears as a total 
number of possible empirical distributions from t samples from Fi . Simila rly, whereas non-asymptotic 
Sanov's theorem is also known for continuous support distributions (see iDembo and Zeitounil (| 19981 
Ex. 6.2.19)), it requires the total number of e-balls to cover a set of distributions as a coefficient. 
Thus, although it is not impossible to derive a finite-time regret bound by a naive application of the 
non-asymptotic Sanov's theorem, it becomes very complicated and unrealistic. 

To avoid counting or covering the possible empirical distributions, we exploit the following fact 

D inf (Fi,fj,)= max E p , [log(l - (X - fj)u)} . (2) 

Although it involves a maximization operation, it is merely an empirical mean of random variables 
log(l — (X t — fi)v) where each X t follows distribution Fj. By Cramer's theorem we can bound the 
large deviation probability for such a finite dimensional empirical mean by an exponential function 
with a simple coefficient. 

Another difficulty for our setting is that Anf(-F, fi) = D- m f(F,fi;A) is neither bounded nor con- 
tinuous in F G A unlike the case of Ao, which makes the evaluation of the exponential rate for 
the large deviation probability of Anf (-Fi, fi) much harder. The key to this problem also lies in ©. 
Since it is an expectation of a logarithmic function on X, the effect of the tail weight is weaker 
than the polynomial function X 1 — X. Thus the large deviation probability of the joint distribution 
of (A rl f (Fi, fi), E^, .[X]) can be evaluated on the same regularity condition as that for the empir- 
ical mean E [X] alone, namely, the existence of the moment generating function of Fi in some 
neighborhood of the origin. 

Paper Outline. In Sect.[5]we give definitions used throughout this paper and introduce DMED 
policy proposed for distributions on [0,1]. In Sect. [31 we give the main results of this paper on 
the finite-time regret bound of DMED for distributions on (— oo, 1]. The remaining sections are 
devoted to the proof of the main results. We extend some results for the support [0, 1] to (—00, 1] in 
Sect.[U We derive a large deviation probability for Anf (F, fi) in a non- asymptotic form in Sect. [5] 
We conclude this paper in Sect. [6] We give some results on large deviation principle in Appendix [A"l 
A proof of the main theorem is given in Appendix [Bj 



Algorithm 1 DMED Policy 
Parameter: r G (0, 1). 

Initialization: Lc, Lr := {1, • • • , K}, Lm :— 0, n :— K. Pull each arm once. 
Loop: 

1. For i G Lc in ascending order, 

1.1. n := n + 1 and pull arm i. Lr := Lr \ {i}. 

1.2. Ljy := Ljy U {j} for all j ^ Lr such that the following J' n {j) occurs: 

J'Jj) = {(1 - r)Ti(n)D in{ (Fi(n),fl*(n)-,A a ) < logn}. (4) 

2. Lc 7 Lr :— Ln and Ljv := 0. 



2 Preliminaries 

Let A a , a G (—00, 1), be the family of probability distributions on [a, 1]. We denote the family of 
distributions on (—oo, 1] by A-oo or simply A. For x G R and i* 1 G A, the cumulative distribution 
is denoted by F(x) = F((— oo, x]). For the metric of A a we use Levy distance 

d L (F,G) = inf{ft > : F(x - h) - h < G(x) < F(x + h) + h} . 

Ejr[-] denotes the expectation under F G A. When we write e.g. E_f[m(X)] for a function u : R — > R, 
X denotes a random variable with distribution F. The expectation of F is denoted by E(F) = E^[X]. 
We always assume that the moment generating function Ei?[e AX ] is finite in some neighborhood of 
the origin A = 0. 

Let Tj(n) be the number of times that arm i has been pulled through the first n rounds. F^t 
and fiij denote the empirical distribution and the mean of arm i when arm i is pulled t times. 
Fi(n) = F i>Ti ( n \ and fli(n) = fii^in) denote the empirical distribution and the mean of arm i at the 
n-th round. The largest empirical mean after the first n rounds is denoted by [l* (n ) = max^ /tj(n). 

In this paper we analyze DMED policy proposed bv lHonda and Takemural (|2010f ). It is described 
as Algorithm [TJ where 

D ini (F,^A a )= inf D(F\\G) . (3) 

GeA:E(G)>/j 

Note that this policy is parametrized by r G (0, 1) in this paper, which was fixed to r = in 
the original proposal. This parameter arises because some properties on D- m f(F, fj,] A a ), such as 
boundedness and continuity, do not hold for a = — oo. For r > we conservatively (i.e. more often) 
choose seemingly suboptimal arms. As a result, the coefficient of the logarithmic term becomes 
1/(1 — r) times the theoretical bound. 

Another mi nor change is that log n in Q was logn — logTi(n) in the original proposal. It 
is described in iHonda and Takemural ((2010) that the term logXi(n) is only for improvement of 
simulation results and has no importance for the asymptotic analysis. In this paper we avoid this 
term since it makes the constant term in the finite-time analysis much more complicated. 

For the setting of a = 0, the regret of DMED is evaluated as follows. 

Proposition 1 (Hon da and Takemural ([2010, Theorem 4)) Let e > be arbitrary. Under 
DMED policy with r = 0, it holds for all (i*i, . . . , Fk) G Aq and suboptimal arms i that 

m(n)} < * + £ logn + 0(l). 

This bound is asymptotically optimal in view of the theoretical bound ((T|). 
Now define 

L{y;F,n) = ^ F [\og{l - (X - //»] , 
L max (F,^) = max L(v;F,fi) . (5) 

0<v< T ±- 

— — i-M 

Functions L and L max correspond to the Lagrangian function and the dual problem of Di n {(F, \x\ A a ), 
respectively. 

Proposition 2 (jHonda and Takemural (|2010L Theorem 5)) For all F G Aq and fx < 1 it 

holds that D ini (F, (m; A ) = L max (F, fi). 



3 Main Results 



We now state the main result of this paper in Theorems [3] and |U We show that the theoretical 
bound does not depend on knowledge of the lower bound of the support in Theorem |3] and that the 
theoretical bound is actually achievable by DMED in Theorem |4j 

Theorem 3 Let a G [— oo, 1) and F G A a be arbitrary, (i) Z?i n f (F, [i; A a ) — D i a f(F, /x; A), (ii) If 
fi<l then D iD {(F, fi; A) = L max (F,fi). 

We prove this theorem in the next section. The part (i) of this theorem means that the theoretical 
bound does not depend on whether we know that the support of distributions is bounded from 
below by a or we have to consider the possibility that the support of distributions may not be 
lower-bounded. Furthermore, from (ii) , we can express the theoretical bound in the same expression 
as Aq for any distribution in A. In view of this theorem we sometimes write Di n f(F, fj,) instead of 
more precise Anf (F, //; A a ) or D- lu $ (F, li]A). 

Let I opt = {i : ^ = fi*} C {1, • • • , K} be the set of optimal arms and // = max^i opt /ij be 
the second optimal expected value. Define Fenchel-Legendre transform of the moment generating 
function of Fk as 

A^aO^sup^z-logEpJe^]} . (6) 
Then E[T,(n)] is bounded for ^ s = eD inl [Fi, fj,*) - (5/(1 - fj,*) as follows. 

Theorem 4 Assume that /i* < 1. Let e > and i (£ I opt be arbitrary and fix any S G (0,/i* — f/) 
such that Ci,e,8 > 0. Then for all n > 

E[Ti(n)l < t w log " — ; + C , 

where, for A*(-, •, •) defined in (|13p . the constant term is given by 

1 \ ^ K x ^ K 



C = 



1 _ r -A'(€«.«. 4 ,Mi,/i*) ^ i _ C - A * ('*"'-*) ^ l-p K [ i''—' ] 
fceXopt fc^i p t 

. f 2(1 + if) 2e 1 
f mm < H 77 > . 



r(l 

We pr ove this theorem in Appendix |B] The proof is largely the same as that of lHonda and Takemural 
(2010, Theorem 4), with difference that asymptotic large deviation probabilities are replaced with 
non-asymptotic forms in Theorems [TT] and 1121 

As described in Prop.HH (iii) of Appendix 1X1 A£(-) corresponds to the exponential rate of the 
probability on the sample size that the empirical mean of arm k deviates from its expectation. We 
can bound this rate in an explicit form for some cases. For examp le, it can be bo unded by the 
variance for the case that the support of Fk is bounded from below (jHoeffdind . Il963l Theorem 1). 
However, it seems to be impossible to bound the rate by its finite-degree momen ts for an optimal 
arms k G X opt in general case, although it is possible for suboptimal arms k £ X opt (|Hoeffdinsj . 119631 
Theorem 3). 

Remark 5 The derived bound is somewhat weaker than that for the bounded support model in 
Prop. [7] since the bound in this theorem contains the coefficient 1/(1 — r) in the logarithmic term. 
We can remove the effect of the parameter r from the logarithmic term by letting r depend on Ti(n), 
e.g., r = 1/ ' yjTi{n). However, it makes the analysis longer and we omit the evaluation of this version 
for lack of space. 

4 Properties of D in f in the Semi-bounded Support Model 

In the analysis of DMED it is essential to investigate the funct i on D - ln j (F, n\ A). In this section we 
extend some results on D nl [(F, (i; Aq) in lHonda and Takemural (|2010T) for our model A = A-oo and 
prove Theorem [3l 

First we consider the function L(v\F,n) = E^[log(l — (X — pb))v\. The integrand l(x, v) = 
log(l — (x — fi)v) is diffcrcntiablc in v G (0, (1 — ) for all x G (— oo, 1] with 

dl(x,u) x — h d 2 l(x,v) (x — n) 2 

dv 1 — (x — fijf ' 6V 2 (1 — (x — [i)v) 2 



Since they are bounded in x G (— oo, 1], the integral L(v; F, fi) is differentiable in v with 

X-n 



t i ( TTi \ dL(v;F,fi) 

L (v; F, n) = = -Ei 

ou 

d 2 L(v;F,u) 
L"(u;F,n) = 7r^ = - E ^ 



1 - (X-fi)v_ 
(X M ) 2 



.(1-(X- M )^ 

From these derivatives the optimal solution v*(F,fi) = argmax 0<l/< ( 1 _ /J j-i L(i/;F,fi) of ([3]) exists 
uniquely and satisfies the following lemma. 

Lemma 6 Assume that E(F) < fj, < 1 holds. If E F [(1 - n)/{l - X)) < 1 then v*{F,^) = 
(1 — /i)" 1 and therefore Ei?[l/(1 — (X — fjt)v*)] < 1. Otherwise, L'(y*;F,fx) — and 
E F [l/(l-(X-fi)p*)] = l. 

The differentiability of L max (F, fi) in /i also holds as in the case of bounded support. 
Lemma 7 For /i > E(F), Di n f(F,fi) is differentiable with 

dD ^ F >^-v*(F,»)< 1 



d/i 1 — /x 

We om it the proofs of Lemmas|n]and[7]since they are the same as Theorems 3 and 5 of lHonda and Takemural 
(1201 It ) where the assumption on the support is not exploited. 

Define F^ e A a as the distribution obtained by transferring the probability of (-co, a) under 
F to x = a, that is, 



x < a , 
F(x) x > a . 



Now we give the key to extension for the semi-bounded support in the following lemma, which shows 
that the effect of the tail weight is bounded uniformly if the expectation is bounded from below. 

Lemma 8 Fix arbitrary fi, p, < 1 and e > 0. Then there exists a(e) such that |Z/ max (-F( ), /i) — 
Lmax(F, < e for all a < a(e) and F e A such that E(F) > ft . 

Proof: Take sufficiently small a < min{0, /i} and define A = (—oo, a), B = [a, 1]. Note that 
F(A) + F(B) = 1. First we have 

< i— ^ (7) 

1 — a 

/ xdF(a;) > + (8) 



from 



E(F) < ai^) + 1 • = 1 - (1 - a)F(A) , E(F) < [ xdF(x) + 1 • F(B) 

J A 



respectively. Next, L max (F, /i) can be written as 

L max (F,/i) = max E F [log(l - (X - (i)v)] 

= max (/ \og \~^~ ^ ' dF(x)+ f log(l - (x - n)u)dF (a) (x)\ ■ (9) 

Since (1 — (a; — jti)i/) /(l — (a — /^V) is increasing in for x < a, substituting and (1 — A*) into z^, 
we can bound the first term as 

< / logi— ( f^J^dF(x)< f logl^dF^) 
J a l-( L a-(i)v J A 1 - a 



< F(A) J Jog(l-x)^i (bya<0) 

< F(A) log ( / (1 - x) ) (Jensen's inequality) 

\Ja f {A) J 



<F(A)\og^—^. (by©) 



From lim a; _j.o x log x — and ([7|). the first term of ([5]) converges to as a — > — oo. The second term 
of © equals L max (F( a ),^) and the proof is completed. ■ 

Now we show Theorem [3] based on the preceding lemmas. 

Proof of Theorem [3} (i) The proof is straightforward since D(F\\G) > D(F\\Gt a \) always holds 
for F G A a - 

(ii) First we consider the case that F has a bounded support, i.e. F £ A a for some a £ (— oo, 1). 
It is easily checked that L max (F, /i) defined in ([5]) is invariant under the scale transformation [0, 1] — > 
[a, 1] : x M> a + (1 — a)x. Further, Anf (F, /x; A a ) defined in is also invariant with respect to 
scale from the invariance of the divergence. Since D in f(F,/j,;A a ) — L max (F,fi) holds for a = from 
Prop.[U it also holds for all finite a < 1. 

Next we consider the case that the support F is not bounded from below. We show Anf 
(F, n;A) < L max (F,fi) and Anf (F, fx; A) > L max (F,fi) separately. We omit t he proof for the former 
part fo r lack of space, but it can be proved in a similar procedure as the proof of lHonda and Takemural 
(120101 Theorem 8). 

Now we consider the latter inequality. Take arbitrary e > and let a < (J, be sufficiently small. 
Partitioning (— oo, 1] into A = (— oo, a) and B = [a, 1] we can bound Anf (F, fi; A) as 

inf D(F\\G) > inf D{F (a) \\G (a) ) 



> 



r .A ^ D ^ G (^ (byE(G)<E(G (a) )) 



> L max (F, (j.) - e (by Lemma [5J 
and we complete the proof by letting e I 0. ■ 

Finally we consider the continuity of Di n f(F, (J,; A) in f . 
Lemma 9 If a < 1 is finite then D i n f (F, fi; A a ) is continuous in F £ A a - 

This lemma is proved for the case a = in iHonda and Takemural (|2010l Theorem 7) . The extension 
for general bounded supports is straightforward from the scale transformation. 

For the case of semi-bounded support distributions, the continuity does not hold any more. 
However, we can show the continuity over distributions with expectations bounded from below. 
Here recall that in view of Theorem[3] we write A„f (F, /i) instead of Anf {F, fi; A-oo) — L max (F, fj,) 
when no confusion arises. 

Lemma 10 Let e > and /i, jl < 1 be arbitrary. There exists 5 > such that 

\D ini (G,fi)-D ini (F,fi)\<e (10) 

for allG €A such that E(G) > fj, and d L (F, G) < 5. 

Proof: Applying Lemma [5] twice to F and G, there exists a(e) such that 

|A„f(G,^)-A„f(^M)l < |A nf (G (a) ,/i)-A n f(i ? (a),M)l + e/2 (11) 

for all a < a(e) and G such that E(G) > fj,. From the continuity of Anf (■, A 4 ) for bounded distribution 
in Lemma [9l there exists <5(e, Fr a \) such that 

|Anf(G (a) ,A*)- D inf (F {a) ,fx)\ < e/2 (12) 

for all G (q ) such that <2 L (G (o ) , F( ) ) < (5(e,F (a) ). Note that d L (G( a ) , F( a ) ) < d L (G,F) obviously 
holds from the definition of Levy distance. Therefore, from (fTTj) and (|12l) . we obtain (fTU)) for all 
G e A such that E(G) > fi and d L (F, G) < 6(e, F {a{e)) ). U 



5 Large Deviation Probabilities for D m { 

In this section we consider the behavior of Anf (-P*> A*) where is the empirical distribution of t 
samples from distribution F, which approaches D- m f(F, /z) as t increases. For our case of semi- 
bounded support, it is sometimes convenient to consider the joint distribution of empirical mean 
fit = E(i 7 i) and distribution F t , since the convergence of the empirical distribution does not mean 
that of the empirical mean. 

Note that, in this section and Appendix [XJ we sometimes consider moment generating functions 
and their Fenchel-Legendre transforms of random variables on domains other than R. Since the 
underlying distribution is obvious from the context, we write e.g. Aj^ 2 to clarify the domain, whereas 
the subscript was used to indicate the arm such as A£ in previous sections. 



Theorem 11 If fi < E(F) and u > Anf (F, n) then 

'2 e - tA H^) u < A|(/x), 



PplDjatiFt,!*) >u n fit </x] < 



2e(l + t)e ™ otherwise, 



where Ajjj(x) = sup AgR {Ax — logE F [e AX ]}. 

Theorem 12 Fis arbitrary [i > E(F) and u > 0. TTiera if Zio/ds /or Co > 2.163 f/iaf 

P F [D inf (Ft,fl) < Anf^/i) "«] < e- tA *^' E ( F )^) 

where 

A> ; E (n ,)^ p^ 1E(Fh ~~ 2i o ^ ); d3) 

I § - g(c H 1 _ fl ) , otherwise. 

We prove these theorems using Prop.[T4l Theorem [TBI and Prop. [TBI in Appendix [A] Before proving 
Thcorcm llll we show its asymptotic version in the following. 

Lemma 13 ///i < E(F) and u > Anf(-F, m) then 

limsup - log Pp [Anf (-Ft; fi) > u n fit < fj] < — max{u, AJj(^)} . 

t^too t 

Proof: Define C = {(G,E(G)) : G £ A, Anf(G,/z) > u n E(G) < /ij C i x M and let C be its 
closure. First we show that Anf(G,/x) > u and v < /j, for all (G,v) G G. 

From the definition of closure, there exists a sequence {(G;, E(G;)) £C}i such that (G/,E(Gj)) — > 
(G, w), i.e., G; — > G and E(G;) — > v. Thus E(G;) > v — e holds for all sufficiently large I where e > 
is arbitrary. Therefore, from Lemma [TOl we obtain 

Diui(G,fi) = lim Anf (G;,/i) > liminf u = u . 

I— >oo Z— ^oo 

The inequality v < /i is obvious from E(G;) — > v and E(G;) < /i. 
Now we obtain from Theorem[T51 that 

limsup - log [Anf (A, ft) >u H fx t < (4 

t— >oo t 

< lim sup - log P F [(F t , fa) G G] 

< - inf sup ( / cf>(x)dG(x) +\v- log / e *< x > +Ax dF(x) 

(G,v):D in{ (G,ij,)>unv<ii (0 : A)eC b (R)xR U J 

< - inf max{A*(u), D(G\\F)} (14) 

(G,i>) : Anf(G,(i)>«n«<(i 

< - inf max{A*(w),A„f(G, M )} (by /x < E(F)) 

(G,i)):Arf(G,)i)>Bni)<(i 

< — max{A K (^i), w} , (Ag(f) is decreasing in u < fi < E(F)) 

where $T4} follows from ({0} xI)U (C h (R) x {0}) C C h (R) x K and Prop.QH ■ 

Proof of Theorem lllt Let S > be arbitrary and define Vi = 1/(2(1— fj,))+i6 for i = —Ms, — M$ + 
1, . . . , M<5 — 1, M^, where Ms = [1/(2(1 — m)^)J • Further define f-A/ a -i = and vm s +i = 1/(1 — A*)- 
Then {[fj, fj+i]} partitions [0, (1 — A 4 ) -1 ] into intervals with length not larger than <5. Therefore the 
event {Anf (F, A 4 ) > can be expressed as 

{D in ((F t ,n)>u} = {3ve [o.izj , L(u;Ft,fi) > n| 

-1 Mj+l 

= [J {3f G [fj,fj+i] ,L(v;F t ,fi) > u\ U (J |3f G h-i,fj] , L(v; F t , fi) > uj . (15) 

i=— M a — 1 i=l 

Since fj+i — Vi < 8 and L(f; i 7 *, /x) is concave in i^, it holds for i < —1 that 

|3f G [fj,fj+i] , L(v;F t ,ij) > u| C |i(i/ i+ i;F t ,i/) - 5min{0,A(i/j + i;Ft,f)} > u| 

C [x( i / i+1 ;F t ,i/)-5min{0,A(f ;-Fi,f)} > «} . (16) 



Similarly it holds for i > 1 that 

(3u E [vi-i,Vi] , L(i/;F t ,fJ,) >u\ C |L(^_i;F t ,^) + c5max{0, £'(^ ; F t , fj,)} > uX 
Here the derivative is written as 



(17) 



1 1. 



-E, 



1 



1 - (X - 



1- (X-(i)v 

Since 1/(1 — (a; — (J>)v) is positive and increasing in a; < 1, it is bounded as 

1 ~ , 1 1 1 l-u 

- > L (l>; F t , /i) > = -- . 

^ 1/ 1/ 1 — (1 — /ijz/ 1 — (1 — ^tji/ 

Thus L'(u ;F t ,fi) = L' (1/(2(1 - n));F t ,fi) is bounded as 

2(l- M )>i'KF t , M )>-2(l-M) • 
Combining this inequality with (TTSl) , (|T5|) and (jTTJ) we obtain 

P F [D in{ (F t ,^) >uf) ik<lA 

L(vi\ F t ,ii) > u - 2(1 - (i)5 n At < fJ- 



< E * 

-hh-l<i<M s +l, i^O 



(18) 



Now regard!" = (yW,!^ 2 - 1 ) = (log(l — (X— /i)^i), X) as a random variable on R 2 . Define a closed 
set C = [u— 2(1 — /j,)S, oo) x (— oo, /i] C M 2 and its a-blowup C a = (it — 2(1 — /i)5 — a, oo) x (-co, fi+a) 
for a > 0. Then the event {L(vi, F t , /i) > it — 2(1 — fi)8 fl At < m} is equivalent to the event that the 
empirical mean of Y is contained in the closed convex set C . Thus we obtain from Prop. [14] (i) that 

(19) 



P F [F(^; F t ,/i) > u - 2(1 - fj,)S H At < m] < exp -t inf AJj 2 (?v) 

where AJj 2 (y) is defined by (f2"T)) . Since C Q D C is open, the exponential rate is bounded as 

- mf A* 2 (y) < - inf A* 3 (y) 
yGC yec a 

< liminf - log P F [L(vi] F t , fi) > u — 2(1 — /i)<5 — a n At < /x + a] 

t^OO t 

(by Prop.[ll(ii)) 

< limsup - log Pp[D inf (F t , fi) > u — 2(1 — fi)5 — a P\ fi t < fJ. + a] 

t— >oo t 

< — max{u — 2(1 — fi)S — a, Ajjj(/i + a)} . (by Lemma [Tc 



Letting a|0 we obtain 



inf Ajjj 2 (y) < - max{u - 2(1 - fi)S, Ajg(jii)} 

y&C 



(20) 



Finally we obtain from (US]), (JTjJ) and (HO]) that 



PHAm (F t , /x) > a? n At < A*] < 2 1 



exp (— t max{u — 2(1 — fi)S, Ajjj(/i)}) 



2(1 -fJ>)8, 

and we complete the proof by letting S — > oo for u < AJj(/i) and S — l/(2i(l — /i)) for u > AJ(/x). 
Proof of Theorem I12t Let u = Anf(F, /x) — i>. First we obtain for i/* = v*(F,(i) that 



Pp[A n f (Ft, M) < Anf (F, fl)~v] = P F 

< P F [E Pt [log(l-(X-fi)v*)}<u] 



max Ef [log(l — (A" — itW)l < u 



Define random variables Y = 1 — (X — and Z = logF = log(l — (X — n)v*) where X follows 
the distribution F. Let Z t be the mean of t i.i.d. copies of Z. Then, from Prop.[T4l (iii). the above 
probability is bounded as 



P F [D inf (F t ,fx) < Anf (F, m) -»] < P F [Z t < u] < e~ tA ^ , 



(21) 



where A^(u) = sup A {Xu - logE F [e xz ]} = sup A {Xu - logE F [Y A ]}. 

Note that E F [e- l z ] = E F [(l-(X-^i)i/*)- 1 ] < 1 from Prop.[B]and E F [c l z ] = E F [l-(X-(j,)u*] = 
1 — (E(F) — n)v* . Since they are finite, the moment generating function E F [e AZ ] = E F [K A ] exists 
for all A G [—1, 1] and infinitely differentiable in A G (—1,1). 

Before evaluating A*(u) we bound E F [y A ] for A G [—1,1]. For A G [—1,0], we obtain from 
E F [y _1 ] < 1 and the convexity of y A in A that 

E F [F A ] < E F [(-A)y- X + (1 + A)Y°] < -A + (1 + A) = 1 . (22) 

Similarly, we obtain for A G (0, 1] that 

E F [r A ] < E F [(i-A)r° + Ar 1 ] 

= (1-A) + A(1-(E(F)- M K) 
= 1 + A( M - E(F))i/ 

< i + i.^ = llM. (by ^ > E(F) and f* < j^tj) (23) 



1 — fi 1 — // 

Define the objective function in Ajjj(u) as i?(A) = Xu — logE F [F A ]. Then, for A G [—1, 0], 

R 'W= U - E % X { Y^] Y] ~ u ~ E F[Y x \ogY] . (by mi) (24) 

We bound R(X) from below for A G [—1/2, 0] in the following. For the second term of the right-hand 
side of flU), it holds for A G [-1/2,0] that 

E F [r A lo g r] > E F [y°logF]- f° max f dE F [y A log Y] 1 ^ 

A Ae[-±,o] t dA J 

= AnfOF» + A max E F [y A (log F) 2 ] . (25) 
Ae[-i,o] 

Note that (logy) 2 is smaller than y^ 1 ' 2 for y — > +0 and smaller than y for y — > oo. Therefore 
there exists c > such that (logy) 2 < c y _1 / 2 + y for all y > 0. In fact, this inequality holds by 
letting c > 2.163. Then we obtain from and that 

V F {Y x (logY) 2 }<V F {Y x (c Y- 1 / 2 + Y)}<c Q + 1 - E{F) . (26) 

1 - n 

Combining (g5]) and (gBJ) with R(X) = we obtain 

1 -E(F)\ / 1 -E(F)\ 



1-M 



i?'(A) < u - D ini (F, M ) - A ^c + 1 YZ^ ) = -u - A ^co + 
i?(A) = + £ R'(\)d\ > -Xv - y (eo + ^yj^ ) • 

Finally, 

Ai(u) = supi?(A) > sup i?(A) > ^+WJ i ' 

a Ae[-i,o] [| - f (co + pz^ , otherwise, 

and wc obtain the theorem with (12"T1). 



6 Concluding Remarks 

We proved that the theoretical bound only depends on the upper bound of the support in the 
nonparametric stochastic bandits. We refined the analysis of DMED policy to a non-asymptotic 
form for all distributions with moment generating functions in this model. 



A Large Deviation Principle and its Application to a Joint Distribution 



In this appendi x we consider large deviation prin ciple (LDP) for the empirical mean and the distri- 
bution based on lDembo and Zeitounil (|1998| ) (|DZl . hereafter). We first summarize results on LDP for 
empirical means of finite dimcntional random variables and then we derive LDP for joint distribution 
of the empirical distribution and the mean in Theorem 1151 

Let St be the empirical mean of i.i.d. random variables X±, ■ ■ ■ ,X t £ X with distribution F, 
where X is a general topological vector space. For a distribution on R, we can regard its empirical 
distribution as the empirical mean of delta measures 5x t E A G V, where V is the space of all finite 
measures on (-co, 1]. We write fit and F t instead of St for empirical means of Xi eK and Sxi G A, 
respectively. 

Define the logarithmic moment generating function and its Fenchel-Legendre transform for dis- 
tribution F by 

log / e <A - u) dF(u) , 
J x 

sup {(A, x) — A(A)} , 
\ex» 

where X* is the space of linear continuous functions on X. Especially, for the case X = R rf it is 
expressed for X* — W l as 

(A, x) = x ^ » A,xeR d . (27) 

i 

Similarly, for the case X = V, it is expressed for X* = Cb(R) as 

= J <j)(u)AG{u) , <f> g C b (R), G G A , 

where Cb (R) is the space of bounded continuous functions on R. Note that it is shown in iDZl that 
in the scope of our paper h* x (x) is always a rate function, that is, a lower semicontinuous function 
with range [0, oo], although we omit this statement in the following. 

Proposition 14 ([PZl. Ex. 2.2.38, Theorem 2.2.30 and Lemma 2.2.5) Let X = R d and as- 
sume that A H d(A) exists around A = 0. (i) For any convex closed C C R d 

j log Pf [St G C] < — inf (x) . 

(ii) For any open A C R rf 

lim inf - log P F [S t G A] > - inf AL(a;) . 

(iii) For the case d = 1, Ag(x) is decreasing at x < E(i^) and increasing at x > E(i^). Consequently, 

\ log P F [At < x] < -A* (x) , if x < E(F) , 

\ log Pf [fa >x]< - A* (x) , i/ x > E(F) . 

In well-known Sanov's theorem, LDP for the empirical distribution is considered. On the other 
hand, in the proof of theorem [TTJ we have to consider the joint probability that the empirical 
distribution and the mean deviate from a subset of A x R. Theorem [15] below is an extension of 
Sanov's theorem for this purpose. This theorem is derived from Cramer's theorem in the same way 
as the derivation of Sanov's theorem. 

Recall that we assume that R is equipped with the standard topology and A is equipped with 
the topology induced by Levy metric d\, (F, G) for F, G G A. For the space ylxlwe use the product 
topology of A and R, which is equivalent to the topology induced by the metric max{dL(-F', G), \x— y\} 
for (F,x), (G,y) G A x R. 

Theorem 15 Let F be arbitrary distribution on R such that the moment generating function exists 
in some neighborhood of A = 0. For any closed set C C A x R, it holds that 

limsupilogP F [(P t ,/t t )G(7]<- inf A* VxR ((G,x)) , (28) 
t^oo t (<3,a)eC 



A* (A) = 
A* x (x) = 



where 



A* VxR ((G,x j) = sup ( / #t*)dG(u) + \x - log f c^+ Xu dF(u) } . 

(<M)eC b (R)xR U J J 

For the actual computation of AT, R (-) the following proposition is useful. 
Proposition 16 ((DZl . Lemma 6.2.13) For all F,G £ A, 

sup ( / 4>(u)AG{u) - log / e* M dF(«)i = L>(G||F) . 

For the rest of this section we prove Theorem [T5l We start with Cramer's theorem for general 
Hausdorff topological vector spaces X and probability measures F o\\ X. 

Proposition 17 (jDZl . Theorem 6.1.3) A ssume that following (a), (b) hold, (a) X is locally con- 
vex and there exists a closed convex subset £ of X such that P F (£) = F Further, £ can be made into 
a Polish space with respect to the topology induced by £ . (b) The closed convex hull of each compact 
K C £ is compact. Then it holds for all compact closed set that 

limsup - logP F [S" t € C] < - inf A* x (x) . (29) 

t^oo t xEC 

The assertion of this proposition is restricted to compact sets and is called weak LDP. We can remove 
this restriction to full LDP if the exponential tightness is satisfied. The laws of St are exponentially 
tight if, for every a < oo, there exists a compact set K a C X such that 

lim sup - log Pf [S t £ K%\ < -a , 

t->oo t 

where superscript "c" denotes the complement of the set. 

Proposition 18 (|DZ] . Lemma 1.2.18) If the laws of St are exponentially tight then (|29[) holds 
for all closed set C . 

Proposition 19 (|DZl . Lemma 6.2.6 and Discussion after Eq. (2.2.33)) (i) The laws of the 
empirical distributions F t £ A are exponentially tight for all F £ A. (ii) The laws of the em- 
pirical means p, t £ M. are exponentially tight if the moment generating function Fi F [e xx ] exists in 
some neighborhood of X = 0. 

Proof of Theorem 1151 First we can obtain (|28|) for all closed compact C C A x M as a direct 
application of Prop.[T7] with X := V x M with £ := A x R by the following argument. 

For the case X := V and £ := A, it is shown as Sanov's theorem that the ass ump tion of Prop. [171 
is satisfied when A is equipped with the topology induced by Levy metric (see iDZl . Sect. 6.1). The 
essential point in the proof of Sanov's theorem is that the local convexity in th e assumption is satisfied 
i f a ve ctor space X is equipped with a topology called weak topology (see, e.g.. lDunford and Schwartz] 
(1988, Chap. V) for detail of weak topologies). Since the relative topology on A of the weak topology 
of V is equivalent to the topology induced by the Levy metric, Prop. [T7] is applicable for the case of 
Sanov's theorem. Here note that the weak topology of V x M is equivalent to the product topology 
of the weak topologies of V and R. Thus it is shown in a parallel way that the assumption is also 
satisfied in our case. 

In view of Prop.[THl we complete the proof if the exponential tightness of the laws of (Ft, fit) is 
proved. From Prop. 033 for every a < oo there exist compact A a C A and B a C K such that 

limsup - log P F [F t £ A c a ] < -a , limsup - log P F [fi t £ B c a ] < -a . (30) 

t— too % t— too * 

Letting K a := A a x B a we obtain 

P F [(F U fj t ) £ K%\ < P F [F t £ A c a ] + P F [fi t £ B c a ] . 
Combining this inequality with ()30|) we see that the laws of (Ft, (it) are exponentially tight. ■ 



B Proof of Theorem [4] 

Define events A n , B ni C n , D n for any (5 > as 



A n 



{£»>// -S) 

K 

{£» < n' + 5} = f| {Afc(n) < // + 5} 
fe=i 

(J {/T(n) = AfcW >M' + ^} 



(J {£*(«) =/tfe(n) </z*- 5} . 

It is easily checked that {A n U £?„ U C n U is the whole sample space. Let J n (i) denote the event 
that arm i is pulled at the n-th round and recall that J' n {i) is given in Algorithm[T] Then, except for 
the first 2K rounds, the event J n (i) implies that J' n i{i) occurred for some K+l < n' < n. Therefore 
Tj(n) is bounded as 

T<(n) 

n-l 

|J {H(m) = t n J m+X (i)} 



= 2 + E a 

oo 

< 2 + E* 



t=2 \_m=1K 

n-l 



t=2 



|J {T t (m) = tnj;M 

=A'+1 

l-l 

(J {T((m)=* n j;(i)nAn} 



< 2 + E a 

t=2 |_m=A"+l 

and we obtain for the last term that 

n-l 

(J {^H^nJ^niU 

_m=A'+l 



|J {T l (m)=tn J' m (j)nA c m } 



,m=K+l 



E^ 

t=2 



n-l 



< E ^m] 



i=A - +l 
n-l 



n-l 



< E 1 + E 1 + E 1 1^™] ■ 

m=A"+l m =A+l m=A"+l 

In the following Lemmas I20H23I we bound the expectations of these summations and they prove the 
theorem with 2 — 2e/r < 0. ■ 

Lemma 20 Let i £ X op t be arbitrary. If £i >6t s — eDinf(Fi, M*) — <V(1 — M*) > then 



E 



Ei 



n-l 



< 



(J {T i (m)=tn4(i)ni m } 

fog 71 



=X+1 



+ 



(1 - e)(l - r)D in f(Fi,fj,*) l _ e - A *K«,.,*. 



Lemma 21 If 6 < fi* — n' then 

n-l 

E 

.m=if+l 



E 

Lemma 22 



. I 2(1 + A") 
< mm < T-r 

fceiopt 1 — e~ <= 



2e 



E 



E ^ 



.m=K+l 



Lemma 23 



E 



E ^ 



< 



< 



r(l-e~ rA i^'+^y 

K y l — 



2e 
r 



|_m=A"+l 



K y 1— 

^ l_ e - A fe(M*- 



5) ' 



Proof of Lemma I20t In the same way as lHonda and Takemu ra (2010, Lemma 15), we obtain 



E* 

t=2 
< 



|J {4(i)n%)=(n4} 

.m=K+l 

\ogn 



(l-e)(l-r)D in{ (F i)t i*) 
r > 1 



< 



E 

log r- 

log 71 



+ log n 

l -(l_ E )(l_ r) £, inf (F i , M «) 



logn 



(l-e)(l-r)Anf(^,M*) 



(l-r)Anf(-Fi, t ,/i*-<5) <logn 



(l-e)(l-r)Anf(i ? i,Ai*) 



CO 

+ £l [A n f(F i)t ,/i) < (l-e)Anf(^,/J*) 



(31) 



Note that it holds from Lemma [7] and Theorem [T2l that 



D iTd (F iit ,fi* -8)<(l- e)A„f(^,/J*) 

5 



< Pf, 



D inf (F i>t , fj,*) 



1 - /!* 



< (1 -e)Anf(P,/i*) 



< P 



D ini (F itt ,fj,*) < D in{ (Fi,fj,*) - eD ini {Fi,ij,*) 



1 - /i H 



From (gT} and (J22), we obtain 



(32) 



E 



E^ 



|J {T<(m) = t n J[{m) n A m } 



t=2 Lm^K+l 



< 



logn 



(l-e)(l-r)Anf(i^*) 
logn 



1 



(1 - e)(l - r)D ini {Fi, fi*) i _ e -A*tti...«.w.M0 ' 
Proof of Lemma I2U First we simply bound X)m~=A'+i l^m] by 



oo oo 



^ l[#m] < E E MB m D T k (m) = t] , 



(33) 



where fc 6 X opt is arbitrary. By the same argument as lHonda and Takemural (|2010l Lemma 16), the 
event {Anf {Fk,t, ^' + 5) < u fl jik.t < ^ + <5} implies 

oo 

nP-m n T fc (m) =t}< e tu ^ + K . 

m=K+l 

Let P(u) ee P Ffc [D in f(F k . t ,fi' + 5)>un fi k . t < y! + 5]. When we simply write A* k for A£(/j' + 5) 
given in ©, it holds from Theorem UTI that 



E 



^ l[B m n T fc (m) =t] 

,m=K+l 

< / ( e ' u ( 1 - r ) + K)dP{u) 



( e t«(i-r) + x)P(u)l +t(l-r) / e^-^PfaJd 



< 2(1 + if )e~* A;; + 2t(l - r) 



a;; 



-*(A:-(l-r)u) du + 2et ( 1 _ r j( 1+i j / e -*m dti 



< 2(1 + K)e~ tA * + 2c 



-trA? , 2e(l-r) 



(1 + t)e- trA '» 



2e 



< 2(1 + K)e- tA " + — (1 + t)e' 
r 



-trAT 



Taking the summation over t with formula 
we obtain from (1551) that 

2(1 + if) 2e 



E, 



n-l 



E 1 



.m=RT+l 



< 



1 - e" A * 



(1 



2c 
r 



We complete the proof by taking k 6 X opt such that is minimized. 
Proof of Lemmas 1221 and I23t We obtain from the definition of C n that 



n-l 



X! 1 [ C -] ^ E E l[A*(m) = Afc(m)>M / + 5] 

m=K+l fc^I ODt m=K+l 



oo oo 



< ^ ^ ^ l[£*(m) = /* M > n' + 6 H Tfe(m) = t] 
By the same argument as lHonda and Takemural (|2010l Lemma 17), we have 

oo 

E = Am n T k (m) = t] < K . 

m=K+l 

On the other hand, from Prop. [14] (iii) we have 

ft i [AM>M' + «]<e" tA!0 '' +i) > 
where A£(a;) is given in ([6]). Finally we obtain from ([35|) - (|37)) that 

n-l 



E 



E 1[C„ 



=K-1 



< K E E^[Am>/+^] < * E i _ r -A* 

fe^Iopt «=1 fe^Iopt 



and Lemma [22] is proved. In the same way, we obtain Lemma [23] from 



E 



E 



.m=K-l 



feeiopt t=i 



± K E E c ~ 



•tAJ(/x*-«) 



< 



K y 1 



6) 



(34) 



(35) 



(36) 



(37) 



k(fl 
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